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Abstract 

We  report  on  our  experience  with  building  a 
statistical  MT  system  from  scratch,  includ¬ 
ing  the  creation  of  a  small  parallel  Tamil- 
English  corpus,  and  the  results  of  a  task- 
based  pilot  evaluation  of  statistical  MT  sys¬ 
tems  trained  on  sets  of  ca.  1300  and  ca. 

5000  parallel  sentences  of  Tamil  and  English 
data.  Our  results  show  that  even  with  appar¬ 
ently  incomprehensible  system  output,  hu¬ 
mans  without  any  knowledge  of  Tamil  can 
achieve  performance  rates  as  high  as  86% 
accuracy  for  topic  identification,  93%  recall 
for  document  retrieval,  and  64%  recall  on 
question  answering  (plus  an  additional  14% 
partially  correct  answers). 

1  Introduction 

Crises  and  disasters  frequently  attract  international  at¬ 
tention  to  regions  of  the  world  that  have  previously 
been  largely  ignored  by  the  international  community. 
While  it  is  possible  to  stock  up  on  emergency  relief 
supplies  and,  for  the  worst  case,  weapons,  regardless 
of  where  exactly  they  are  eventually  going  to  be  used, 
this  cannot  be  done  with  multilingual  information  pro¬ 
cessing  technology.  This  technology  will  often  have  to 
be  developed  after  the  fact  in  a  quick  response  to  the 
given  situation.  Multilingual  data  resources  for  sta¬ 
tistical  approaches,  such  as  parallel  corpora,  may  not 
always  be  available. 

In  the  fall  of  2000,  we  decided  to  put  the  current 
state  of  the  art  to  the  test  with  respect  to  the  rapid  con¬ 
struction  of  a  machine  translation  system  from  scratch. 

Within  one  month,  we  would 

•  hire  translators; 

•  translate  as  much  text  as  possible;  and 

•  train  a  statistical  MT  system  on  the  data  thus  cre¬ 
ated. 

The  language  of  choice  was  Tamil,  which  is  spoken 
in  Sri  Lanka  and  in  the  southern  part  of  India.  Tamil  is 
a  head-last  language  with  a  very  rich  morphology  and 
therefore  quite  different  from  English. 


2  Data  Collection  and  Preparation 

2.1  Obtaining  Tamil  Data 

Tamil  data  is  not  very  difficult  to  find  on  the  web. 
There  are  several  Tamil  newspapers  and  magazines 
with  online  editions,  and  the  large  international  Tamil 
community  fosters  the  use  of  the  Internet  for  the  dis¬ 
semination  of  information.  After  initial  investigation 
of  several  web  sites  we  decided  to  download  our  exper¬ 
imental  corpus  from  www .  tamilnet .  com,  a  news 
site  that  provides  local  news  on  Sri  Lanka  in  both 
Tamil  and  English.  The  Tamil  and  English  news  texts 
on  this  site  do  not  seem  to  be  translations  of  each  other. 
The  availability  of  a  fairly  large  in-domain  corpus  of 
local  news  on  Sri  Lanka  in  English  (over  2  million 
words)  allowed  us  to  train  an  in-domain  English  lan¬ 
guage  model  of  Sri  Lankan  news. 

2.2  Encoding  and  Tokenization 

Tamil  is  written  in  a  phonematic,  non-Latin  script. 
Several  encoding  schemes  exist  in  parallel.  Even 
though  the  Unicode  standard  includes  a  set  of  glyphs 
for  Tamil,  it  is  not  widely  used  in  practice.  Most  web 
sites  that  offer  Tamil  language  material  assume  Latin- 1 
encoding  and  rely  on  special  true  type  fonts,  which  of¬ 
ten  are  also  offered  for  free  download  at  those  sites. 
Tamil  text  is  therefore  fairly  easy  to  identify  on  web 
sites  via  the /ace  attribute  of  the  HTML /ont  tag.  All 
that  is  necessary  is  a  list  of  Tamil  font  names  used  by 
the  different  sites,  and  knowledge  about  which  encod¬ 
ings  these  fonts  implement.  While  we  could  restrict 
ourselves  to  one  data  source  and  encoding  for  our  ex¬ 
periment,  any  large-scale  system  would  have  to  take 
this  into  account.  In  order  to  make  the  source  text 
recognizable  to  humans  who  have  no  knowledge  of 
Tamil,  we  decided  to  work  with  transliterated  text'. 

2.3  Translating  the  Corpus 

Originally  we  hoped  to  be  able  to  create  a  parallel  cor¬ 
pus  of  about  100,000  words  on  the  Tamil  side  within 
one  month,  using  several  translators.  Professional 

'Translations,  however,  were  produced  from  the  original 
Tamil. 
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translation  services  in  the  US  currently  charge  rates  of 
about  30  cents  per  English  word  for  translations  from 
Tamil  into  English.  Given  that  the  English  transla¬ 
tion  of  a  Tamil  text  usually  contains  about  1.2  times 
as  many  words  as  the  Tamil  original,  the  translation  of 
a  corpus  of  100,000  Tamil  words  would  cost  approxi¬ 
mately  USD  36,000.  This  was  far  beyond  our  budget. 

In  India,  by  comparison,  raw  translations  may  cost 
as  little  as  one  cent  per  Tamil  word^.  However,  out¬ 
sourcing  the  translation  work  abroad  was  not  feasible 
for  us,  since  we  had  neither  the  administrative  infras¬ 
tructure  nor  the  time  to  manage  such  an  effort.  Also, 
working  with  partners  so  remote  would  have  made  it 
very  difficult  to  communicate  our  exact  needs  and  to 
implement  proper  quality  control. 

We  finally  decided  to  hire  as  translators  four  enter¬ 
ing  and  second-year  graduate  students  in  the  depart¬ 
ment  of  engineering  whose  native  language  is  Tamil 
and  who  had  responded  to  an  ad  posted  in  the  local 
mailing  list  for  students  from  India. 

In  order  to  manage  the  corpus  translation  process, 
we  set  up  a  web  interface  through  which  the  transla¬ 
tors  could  retrieve  source  texts  and  upload  their  trans¬ 
lations,  post-editors  could  post-edit  text  online,  the 
project  progress  could  be  monitored,  and  all  incoming 
text  was  available  to  other  project  members  as  soon  as 
it  was  submitted. 

We  originally  assumed  that  translators  would  be 
able  to  translate  about  500  words  per  hour  if  we  were 
content  with  raw  translations  and  hardly  any  format¬ 
ting,  and  if  we  allowed  them  to  skip  difficult  words 
or  sentences.  This  estimate  was  based  on  an  inter¬ 
nal  evaluation,  in  which  multilingual  members  of  our 
group  translated  sample  documents  from  their  native 
language  (Arabic,  German,  Romanian)  into  English 
and  kept  track  of  the  time  they  spent  on  this. 

It  turned  out  that  our  expectations  were  very  much 
exaggerated  with  both  respect  to  translation  speed  and 
the  quality  of  translation.  The  actual  translation  speed 
for  Tamil  varied  between  156  and  247  words  per  hour 
with  an  average  of  170  words  per  hour.  In  139  hours  of 
reported  translation  time  (over  a  period  of  eventually  6 
weeks),  about  24,000  words  /  1,300  sentences  of  Tamil 
text  were  translated,  at  an  effective  cost  of  ca.  10.8 
cents  per  Tamil  word  (translators’  compensation  plus 
administrative  overhead).  This  figure  does  not  include 
the  effort  for  manually  post-editing  the  translations  by 
a  native  speaker  of  English  (12-16  person  hours). 

The  overall  organization  of  the  project  (source  data 
retrieval,  hiring  and  management  of  the  translators, 
design  and  implementation  of  the  web  interface  for 
managing  the  project  via  the  Internet,  development  of 
transliterator  and  stemmer,  etc.)  required  an  additional 

^Personal  communications  with  Thomas  Malten,  University 
of  Cologne. 


estimated  2.5  person  months.  However,  a  good  part  of 
this  effort  led  to  resources  that  can  also  be  used  for 
other  purposes. 

2.4  Lessons  Learned  for  Future  Projects 

If  we  were  to  give  advice  for  future,  similar  projects, 
we  would  emphasize  and  recommend  the  following: 

2.4.1  Good  translators  are  not  easy  to  find 

It  is  difficult  to  find  good  translators  for  a  short¬ 
term  commitment.  Unless  one  is  willing  to  pay  a  pre¬ 
mium  price,  it  is  unlikely  that  one  will  find  profes¬ 
sional  translators  who  are  willing  to  commit  much  of 
their  time  for  a  limited  period  of  time  and  on  short  no¬ 
tice. 

2.4.2  Make  the  translation  job  attractive 

As  foreign  students,  our  translators  would  each  have 
been  allowed  to  work  up  to  twenty  hours  per  week. 
None  of  them  did,  because  the  work  was  frustrating 
and  boring,  and  because  they  found  more  attractive, 
long  term  employment  on  campus.  Our  translators’ 
frustration  may  have  been  fostered  by  several  factors: 

•  the  differences  between  Sri  Lankan  Tamil  (the  va¬ 
riety  used  in  our  corpus)  and  the  Tamil  spoken  in 
Southern  India  (the  native  language  of  our  trans¬ 
lators),  which  made  translating,  according  to  our 
translators,  very  difficult; 

•  the  lack  of  translation  experience  of  our  transla¬ 
tors;  and 

•  our  high  expectations.  We  originally  told  our 
translators  that  since  they  were  not  working  on 
site,  we  would  expect  the  translation  of  500 
words  per  hour  reported.  When  we  later  switched 
to  hourly  pay  regardless  of  translation  volume, 
the  translation  volume  picked  up  slightly. 

2.4.3  Be  prepared  to  post-edit 

In  professional  translating,  translators  typically 
translate  into  their  native  language  only.  One  may  not 
be  able  to  find  translators  with  English  as  their  na¬ 
tive  language  for  low  density  or  “small”  languages, 
so  it  may  be  necessary  to  have  the  translations  post- 
edited  by  people  with  greater  language  proficiency  in 
English. 

2.4.4  Have  translators  and  post-editors  work  on 
site 

It  is  better  to  have  translators  and  post-editors  work 
on  site  and  ideally  as  teams,  so  that  they  can  resolve 
ambiguities  and  misunderstandings  immediately  with¬ 
out  the  delays  of  communicating  indirectly,  be  it  by 
email  or  other  means.  A  post-editor  who  does  not 
know  the  source  language  may  misinterpret  the  trans¬ 
lator,  as  the  following  case  from  our  corpus  illustrates: 


English  Tamil 

Figure  1:  Text  coverage  on  previously  unseen  text  for  English  (left)  and  Tamil  (right).  The  upper  line  in  each 
graph  shows  the  coverage  by  tokens  that  have  been  seen  at  least  once,  the  lower  line  shows  the  coverage  by  tokens 
that  have  been  seen  at  least  5  times.  The  error  bars  indicate  standard  deviation. 


Raw  translation:  Information  about  the  schools  in 
which  people  who  migrated  to  Kudaanadu  are  staying 
is  being  gathered. 

Post-edited  version:  Information  about  the  schools  in 
(sic!)  which  immigrants  to  Kudaanadu  are  attending 
is  being  gathered. 

In  this  case,  the  post-editor  clearly  misinterpreted  the 
translator.  What  the  translator  meant  to  and  actually 
did  say  is  that  information  was  being  gathered  about 
the  schools  in  which  migrants/war  refugees  who  had 
arrived  in  Kudaanadu  had  found  shelter.  However, 
the  post-editor  interpreted  the  phrase  people  who  mi¬ 
grated  to  Kudaana  as  describing  immigrants  and  as¬ 
sumed  that  information  was  being  gathered  about  their 
education  rather  than  their  housing. 

3  Evaluation  Experiments 

3.1  A  Priori  Considerations 

The  richer  the  morphology  of  a  language  is,  the  greater 
is  the  total  number  of  distinct  word  forms  that  a  given 
corpus  consists  of,  and  the  smaller  is  the  probability 
that  a  certain  word  form  actually  occurs  in  any  given 
text  segment.  Figure  1  shows  the  percentage  of  word 
forms  in  unseen  text  that  have  occurred  in  previously 
seen  text  as  a  function  of  the  amount  of  previously 
seen  text.  The  graph  on  the  left  shows  the  curves  for 
English,  the  one  on  the  right  the  curves  for  Sri  Lankan 
Tamil.  The  graphs  show  the  averages  of  100  runs  on 
different  text  fragments;  the  error  bars  indicate  stan¬ 
dard  deviation. 

The  numbers  were  computed  in  the  following  man¬ 
ner:  A  corpus  of  120,000  tokens  was  split  into  seg¬ 


ments  of  1000  tokens  each.  For  each  segment  nk, 
we  computed  how  many  of  the  tokens  had  been  pre¬ 
viously  seen  in  the  segments  rii  . . .  nu-i.  The  upper 
line  in  the  graphs  shows  the  percentage  of  tokens  in  nu 
that  had  occurred  at  least  once  before  in  the  segments 
rii  . . .  rik-i,  the  lower  line  shows  the  percentage  of  to¬ 
kens  that  had  been  seen  at  least  five  times  before. 

For  the  purpose  of  statistical  NLP,  it  seems  reason¬ 
able  to  assume  that  the  lower  curve  gives  a  better  indi¬ 
cation  of  how  many  percent  of  previously  unseen  text 
we  can  expect  to  be  “known”  to  a  statistical  model 
trained  on  a  corpus  of  m  tokens. 

At  a  corpus  size  of  24,000  tokens,  which  is  approx¬ 
imately  the  size  of  the  parallel  corpus  we  were  able  to 
create  during  our  experiment,  about  28%  of  all  word 
forms  in  previously  unseen  Sri  Lankan  Tamil  text  can¬ 
not  be  found  in  the  corpus,  and  50%  have  been  seen 
less  than  5  times.  In  other  words,  if  we  train  a  system 
on  this  data,  we  can  expect  it  to  stumble  over  every 
other  word!  At  a  corpus  size  of  100,000  tokens,  the 
numbers  are  17%  and  33%. 

For  English,  the  numbers  are  9%/23%  for  a  corpus 
of  24K  tokens  and  0%/8%  for  a  corpus  of  lOOK  to¬ 
kens. 

In  order  to  boost  the  text  coverage  we  built  a  simple 
text  stemmer  for  Tamil,  based  on  the  Tamil  inflection 
tables  in  Steever  (1990)  and  some  additional  inspec¬ 
tion  of  our  parallel  corpus.  The  stemmer  uses  regular 
expression  matching  to  cut  off  inflectional  endings  and 
introduce  some  extra  tokens  for  negation  and  certain 
case  markings  (such  as  locative  and  genitive),  which 
are  all  marked  morphologically  in  Tamil.  It  should  be 
noted  that  the  stemmer  is  far  from  perfect  and  was  only 
intended  to  be  an  interim  solution.  The  performance 
increases  are  displayed  in  Figure  2.  For  a  corpus  size 


Figure  2:  Text  coverage  increase  by  stemming  for 
Tamil.  The  solid  lines  indicate  text  coverage  for  un¬ 
stemmed  data  (seen  at  least  once  and  at  least  five  times, 
respectively),  the  dashed  lines  the  text  coverage  for 
stemmed  data. 

of  24K  tokens,  the  percentages  of  unknown  items  drop 
to  19%  (from  28%;  never  seen  before)  and  36%  (from 
50%;  seen  less  than  5  times).  For  a  training  corpus 
of  lOOK  tokens,  the  numbers  are  12%  and  23%  (from 
17%/33%). 

3.2  Task-Based  Pilot  Evaluation 

Given  these  numbers,  it  is  obvious  that  one  cannot  ex¬ 
pect  much  performance  from  a  system  that  relies  on 
models  trained  on  only  24K  tokens  of  data.  As  a  mat¬ 
ter  of  fact,  it  is  close  to  impossible  to  make  any  sense 
whatsoever  of  the  output  of  such  a  system  (cf.  Fig.  3). 

To  get  an  estimate  of  the  performance  with  more 
training  data,  we  augmented  our  corpus  with  a  paral¬ 
lel  corpus  of  international  news  texts  in  Southern  In¬ 
dian  Tamil  which  was  made  available  to  us  by  Fred 
Gey  of  the  University  of  California  at  Berkeley  (hence¬ 
forth;  Berkeley  corpus).  This  corpus  contains  ca. 
3,800  sentence  pairs  with  75,800  Tamil  tokens  after 
stemming  (before  stemming:  60,000;  the  difference 
is  due  to  the  introduction  of  additional  markers  dur¬ 
ing  stemming).  Some  of  the  parallel  data  was  with¬ 
held  for  system  evaluation;  the  augmented  training 
corpus  (Berkeley  and  TamilNet  corpus;  short  B+TN) 
had  a  size  of  85K  tokens  on  the  Tamil  side.  The  aug¬ 
mented  training  corpus  had  a  text  coverage  of  81% 
(seen  at  least  once;  75%  without  augmentation),  and 
67%  (seen  at  least  5  times;  60%  without  augmenta¬ 
tion),  respectively,  for  Sri  Lankan  Tamil.  We  trained 
IBM  Translation  Model  4  (Brown  et  ak,  1993)  both  on 
our  corpus  alone  and  on  the  augmented  corpus,  using 
the  Egypt  toolkit  (Knight  et  ak,  1999;  Al-Onaizan  et 
ak,  1999),  and  then  translated  a  number  of  texts  us¬ 
ing  different  translation  models  and  different  transfer 


methods,  namely  glossing  (replacing  each  Tamil  word 
by  the  most  likely  candidate  from  the  translation  tables 
created  with  the  EGYPT  toolkit)  and  Model  4  decoding 
(Brown  et  ak,  1995;  Germann  et  ak,  2001). 

Eigure  3  shows  the  output  of  the  different  systems 
in  comparison  with  the  human  translation. 

We  then  conducted  the  following  experiments. 

3.2.1  Document  Classification  Task 

Seven  human  subjects  without  any  knowledge  of 
Tamil  were  given  translations  of  a  set  of  15  texts  (all 
from  the  Berkeley  corpus)  and  asked  to  categorize 
them  according  to  the  following  topic  hierarchy: 

•  News  about  Sri  Lanka 

•  Reports  about  clashes  between  the  Sri 
Lankan  army  and  the  Liberation  Tigers 

•  Sri  Lankan  security-related  news  (arrests, 
arms  deals,  etc.) 

•  Sri  Lankan  political  news  (strikes,  transport, 
telecom) 

•  concerns  Sri  Lanka  but  doesn’t  fit  any  of  the 
above 

•  News  about  Pakistan/India 

•  Nuclear  tests  in  Pakistan  and  India,  includ¬ 
ing  their  aftermath  (international  reactions, 
etc.) 

•  Corruption  investigation  against  Benazir 
Buto 

•  News  about  Pakistan/India  but  none  of  the 
above 

•  International  news 

•  Disasters,  accidents 

•  Nelson  Mandela’s  birthday 

•  Other  international  news 

•  Impossible  to  tell 

Except  for  one  duplicate  set,  each  subject  received  a 
different  set  of  translations.  The  sets  differed  in  train¬ 
ing  parameters  and  the  translation  method  used.  Ta¬ 
ble  1  shows  the  results  of  this  evaluation.  The  dif¬ 
ference  between  the  subjects  5a  and  5b,  who  received 
the  same  set  of  translations,  suggests  that  the  individ¬ 
ual  classifiers’  accuracy  influences  the  results  so  much 
as  to  blur  the  effect  of  the  other  parameters.  There 
seems  to  be  a  tendency  for  glossing  to  work  better  than 
Model  4  decoding.  Glossing,  in  our  system,  is  a  simple 
base  line  algorithm  that  provides  the  most  likely  word 
translation  for  each  word  of  input.  Translation  can¬ 
didates  and  their  probabilities  are  retrieved  from  the 
translation  table,  which  is  part  of  the  translation  model 
trained  on  the  parallel  corpus. 

The  document  classification  test  is  foremost  and 
above  all  a  measure  of  the  quality  of  the  translation  ta¬ 
ble  for  frequently  occurring  words.  In  practice,  actual 


Small  Corpus 

24K  tokens  (Tamil)  of  training  data 

Augmented  Corpus 

85K  tokens  (Tamil)  of  training  data 

Human  translation 

Gloss 

Model  4  decoding" 

Gloss 

Model  4  decoding 

Government  -  United 
National  Party 
meeting  will  not  take 
place  tomorrow. 

vanni  united  national 
party  kalloya 
tomorrow 
naTaipeRamaaTTa 

children  united 
national  party 
tomorrow  . 
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Figure  3:  Sample  output  of  various  systems 


Table  1;  Results  of  the  Document  Classification  Task.  Test  subjects  were  asked  to  classify  the  translations  of  15 
documents  into  4  major  and  1 1  minor  categories. 


input 

pegging"? 

transfer 

correct 

partially 

correct* 

incorrect 

1 

raw 

no 

M4  decoding" 

7 

4 

4 

2 

stemmed 

yes 

M4  decoding 

8 

3 

4 

3 

stemmed 

no 

M4  decoding 

13 

2 

0 

4 

raw 

no 

gloss 

13 

1 

1 

5a 

stemmed 

yes 

gloss 

8 

3 

4 

5b 

stemmed 

yes 

gloss 

12 

2 

1 

6 

stemmed 

no 

gloss 

11 

2 

2 

“pegging  causes  the  training  algorithm  to  consider  a  larger  search  space 
^correct  top  level  category  but  incorrect  sub-category 

“^translation  by  maximizing  the  IBM  Model  4  probability  of  the  source/translation  pair  (Brown 
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classification  might  be  performed  by  automatic  pro¬ 
cedures  rather  than  humans.  If  we  dare  to  accept  the 
top  performances  of  our  human  subjects  as  the  ten¬ 
tative  upper  bound  of  what  can  be  achieved  with  the 
current  system  using  a  translation  model  trained  on 
85K  tokens  of  Tamil  text  and  the  corresponding  En¬ 
glish  translations,  we  can  conclude  that  the  classifica¬ 
tion  accuracy  can  exceed  86%  (13/15)  for  fine-grained 
classification  and  reach  100%  for  coarse-grained  clas¬ 
sification.  However,  given  the  extremely  small  sample 
size  in  this  evaluation,  the  evidence  should  not  be  con¬ 
sidered  conclusive. 

3.2.2  Document  Retrieval  Task 

The  document  retrieval  task  and  the  question  an¬ 
swering  task  (see  below)  were  combined  into  one  task. 
The  subjects  received  14  texts  (from  the  TamilNet  cor¬ 
pus)  and  15  lead  questions  plus  13  additional  follow¬ 
up  questions.  Their  task  was  to  identify  the  docu- 
ment(s)  that  contain(s)  the  answer  to  the  question  and 
to  answer  the  questions  asked.  Typical  lead  questions 
were  questions  such  as  What  is  the  security  situation 
in  Trinconmalee?,  or  Who  is  S.  Thivakarasa?;  typi¬ 
cal  follow-up  questions  were  Who  is  in  control  of  the 
situation?.  What  happened  to  him/her?,  or  How  did 
( other)  people  react  to  what  happened? .  As  in  the  pre¬ 
vious  experiment,  each  subject  received  the  output  of 
a  different  system. 

Table  2  shows  the  result  of  the  document  retrieval 
task.  Again,  the  sample  size  was  too  small  to  draw 
any  final  conclusions,  but  our  results  seem  to  suggest 
the  following.  Firstly,  the  test  subject  in  the  group 
dealing  with  output  of  systems  trained  on  the  bigger 
training  set  tend  to  perform  better  than  the  ones  deal¬ 
ing  with  the  results  of  training  on  less  data.  This  sug¬ 
gests  that  the  jump  from  24K  to  85K  tokens  of  train¬ 
ing  data  might  improve  system  performance  in  a  sig¬ 


nificant  manner.  We  were  surprised  that  even  with 
the  poor  translation  performance  of  our  system,  re¬ 
calls  as  high  as  93%  at  a  precision  of  88%  could  be 
achieved.  Secondly,  the  data  shows  that  gaps  are  not 
randomly  distributed  over  the  data,  but  that  some  ques¬ 
tions  clearly  seem  to  have  been  more  difficult  than  oth¬ 
ers.  One  of  the  particular  difficult  aspects  of  the  task 
was  the  spelling  of  names.  Question  11,  for  exam¬ 
ple,  asked  What  happened  to  Chandra  Kumar  Abayas- 
ingh?.  In  the  translations,  however,  it  was  rendered  in 
simple  transliteration:  cantirakumaara  apayacingka. 
It  requires  a  considerable  degree  of  tenacity  and  imag¬ 
ination  to  find  this  connection. 

3.2.3  Question  Answering  Task 

In  order  to  measure  the  performance  in  the  ques¬ 
tion  answering  part  of  this  evaluation,  we  considered 
only  questions  relevant  to  the  documents  that  the  test 
subjects  had  identified  correctly.  Because  of  the  diffi¬ 
culty  of  the  task,  we  were  lenient  to  some  degree  in  the 
evaluation.  For  example,  if  the  correct  answer  was  the 
former  president  of  the  teacher’s  union  and  the  answer 
given  was  an  official  of  the  teacher’s  union,  we  still 
counted  this  as  “close  enough”  and  therefore  correct. 
In  addition,  we  also  allowed  partially  correct  answers, 
that  is,  answers  that  went  into  the  right  direction  but 
were  not  quite  correct.  For  example,  if  the  correct  an¬ 
swer  was  The  army  imposed  a  curfew  on  fishing,  we 
counted  the  answer  the  army  is  stopping  fishing  boats 
as  partially  correct.  All  in  all,  it  was  very  difficult  to 
evaluate  this  section  of  the  task,  because  it  was  often 
close  to  impossible  to  determine  whether  the  answer 
was  just  an  educated  guess  or  actually  based  on  the 
text.  There  were  some  cases  where  answers  were  par¬ 
tially  or  even  fully  correct  even  though  the  correct  doc¬ 
ument  had  not  been  identified.  In  retrospect  we  con¬ 
clude  that  it  would  have  been  better  to  have  the  test 


Table  2:  Recall  and  precision  on  the  document  retrieval  task.  Test  subjects  were  asked  to  identify  the  document(s) 
containing  the  answers  to  15  lead  questions.  Black  dots  indicate  successful  identification  of  at  least  one  document 
containing  the  answer. 
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subjects  mark  up  those  text  passages  in  the  text  that 
justify  their  answers. 

Again,  the  data  suggests  that  the  difference  in  train¬ 
ing  corpus  size  does  affect  the  amount  of  information 
that  is  available  from  the  system  output.  Subjects  us¬ 
ing  output  of  a  system  based  on  a  translation  model 
that  was  trained  on  only  the  TamilNet  data  tend  to  per¬ 
form  worse  than  subjects  using  output  from  a  system 
based  on  a  translation  model  trained  on  the  larger  cor¬ 
pus.  The  poor  performance  on  test  set  No.  6  may 
suggest  that  for  this  task  and  at  this  level  of  transla¬ 
tion  quality,  glossing  provides  more  informative  out¬ 
put  than  Model  4  decoding.  This  result  is  not  partic¬ 
ularly  surprising,  since  we  noticed  that  Model  4  de¬ 
coding  tends  to  leave  out  more  words  than  acceptable. 
Clearly,  this  is  one  area  where  the  translation  model 
has  to  be  improved. 

Test  set  10  is  the  only  set  produced  by  a  system 
using  a  translation  model  trained  on  raw,  unstemmed 
data.  It  is  unclear  whether  the  poor  performance  on 
question  answering  for  this  test  set  is  due  to  a  princi¬ 
pally  worse  translation  quality  or  the  (lack  of)  tenacity 
and  willingness  of  the  test  subject  to  work  her  way 
through  the  system  output. 

All  in  all,  we  were  astonished  by  the  amount  of  in¬ 
formation  that  our  test  subjects  were  able  to  retrieve 
from  the  material  they  received  (the  top  recall  for  the 
question  answering  task  is  64%,  plus  an  additional 


Table  3:  Accuracy  on  question  answering.  The  test 
sets  are  the  same  as  in  Table  2.  Only  questions  con¬ 
cerning  documents  that  were  identified  correctly  were 
considered  in  this  evaluation. 
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14%  partially  correct  answers!).  However,  using  a 
system  such  as  the  one  discussed  in  the  paper  is  not 
an  option  for  actual  information  processing.  Espe¬ 
cially  those  subjects  that  had  to  deal  with  the  output 
of  systems  trained  on  the  smaller  corpus  experienced 
the  task  as  utterly  frustrating  and  would  not  want  to  do 
it  again. 


4  Conclusions 

We  have  reported  on  our  experience  with  rapidly 
building  a  statistical  MT  system  from  scratch.  Within 
ca.  140  translator  hours,  we  were  able  to  create  a  par¬ 
allel  corpus  of  about  1300  sentence  pairs  with  24,000 
tokens  on  the  Tamil  side,  at  an  average  translation  rate 
of  approximately  170  Tamil  words  per  hour. 

Very  clearly,  the  effort  needed  to  create  parallel  data 
is  one  of  the  biggest  obstacles  to  the  rapid  development 
of  statistical  MT  systems  for  new  languages. 

With  the  output  of  a  system  which  uses  a  translation 
model  trained  on  the  small  amount  of  parallel  data  that 
we  created  during  the  course  of  our  experiment,  hu¬ 
man  test  subjects  achieved  a  recall  of  over  50%  on 
the  document  retrieval  task  but  generally  performed 
poorly  on  question  answering  (less  than  20%). 

The  addition  of  an  additional  corpus  of  3,800  sen¬ 
tence  pairs  allowed  us  to  estimate  the  benefits  of  in¬ 
creasing  the  overall  corpus  size  by  roughly  300%. 
Based  on  our  experience  with  translating  the  TamilNet 
corpus,  this  additional  effort  would  require  an  addi¬ 
tional  450  translator  and  36  to  48  post-editor  hours. 

With  the  additional  training  data,  we  were  able  to 
produce  output  that  increased  the  performance  on  our 
evaluation  tasks  (document  retrieval  and  question  an¬ 
swering)  to  up  to  93%  for  document  retrieval  and  64% 
for  question  answering. 

With  respect  to  the  scenario  of  “MT  in  a  month”, 
we  can  now  make  the  following  calculation:  If  we  as¬ 
sume  that  the  average  translator  translates  at  a  rate  of 
170  words/hour  and  is  able  to  spend  6-7  hours  per  day 
on  actual  translations,  then  a  translator  can  translate 
about  1000-1200  words  per  day.  In  order  to  translate  a 
corpus  of  100,000  words  within  one  month  (assuming 
a  five-day  work  week),  we  therefore  need  four  to  five 
full  time  translators.  For  this  effort,  we  can  expect 
a  translation  system  whose  performance  resembles  the 
one  shown  in  our  evaluation. 

This,  of  course,  raises  the  following  questions, 
which  we  are  only  able  to  ask  but  not  to  answer  at  this 
point. 

•  Can  the  translation  model  and  the  algorithms  for 
statistical  training  be  improved  so  that  they  re¬ 
quire  less  data  to  produce  acceptable  results? 

•  Are  there  more  efficient  uses  of  scarce  resources 
(such  as  language  experts  and  translators)  for 
building  a  statistical  (or  any  other)  MT  system 
quickly,  for  example  the  creation  of  less  but  more 
informative  data,  e.g.  a  parallel  corpus  with 
alignments  on  the  word  level,  or  the  compilation 
of  a  glossary/dictionary  of  the  most  frequently 
used  terms? 

•  How  do  the  various  approaches  compare  with  re¬ 
spect  to  the  ratio  of  construction  effort  versus 


performance  improvement  when  the  MT  systems 
are  scaled  up?  One  approach  may  show  rapid 
improvements  initially  but  also  reach  a  plateau 
quickly,  whereas  another  may  show  slow  but 
steady  improvements. 

•  Is  there  any  potential  for  bootstrapping  the  re¬ 
source  creation  process  by  using  knowledge  that 
can  be  extracted  from  little  and  poor  data  to  speed 
up  the  creation  of  more  and  better  data? 

These  are  some  of  the  the  questions  that  will  need 
to  be  addressed  in  future  research  on  Quick  MT. 
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